YouTube today is one of the largest video streaming platforms in the world and yet the way its algorithm handles the content is still quite unclear. We all have had these questions while YouTubing on why certain videos are on the trending list while others are not. This question is especially important for YouTube content creators. YouTube Creators are individuals who produce content for the platform. This is a unique model that empowers Creators to earn money through placed advertising, merchandise sales, and subscriptions.
In this project, we try to understand how YouTube ranks its trending videos. We also aim to provide insights for Content Creators on the kind of content to provide in order to increase their probability of making their video trend and in turn generate more revenue.
We try to address the above conundrum by analyzing the following questions :
We try to understand the impact of geography, culture, globalization, and economy on trending videos.
Watch out! We could be giving you the secret sauce to make your next video go viral.
The Data used in this analysis was extracted from YouTube using the YouTube API from 11 countries.
Script was run on '9th November' to generate data on the Trending Videos.
It consists of 20 columns from various countries like US, Brazil, India, Russia, Canada, Great Britain, France, Germany, Korea, Japan and Mexico which are video_id, title, publishedAt, channelId, channelTitle, categoryId, trending_date, tags, view_count, likes, dislikes, comment_count, thumbnail_link, comments_disabled, ratings_disabled, description, videoresolution, videoduration, videoprojected and videoliscensed with approximately 2200 data rows of all countries combined.
We combined various data points of all the countries into a single sheet and added a country column to identify which country the video originated from. We sourced the data using the Google API. This file is self sufficient and can be used to generate the excel files into an output folder.
Although we have spent a considerable amount of effort in data acquiring tasks-learning to use YouTube API calls and going through the content-heavy documentation, we think our project is data analysis heavy. We have spent more time on the analysis as our main aim was to extract useful insights for content creators, by answering multiple questions mentioned above. Considering this, we have come with 6 visualizations to support our analysis.
For Youtube Analysis, we would require the following libraries. So we start by importing them.
To use the Google API client for youtube, we need to run the below command in the cmd prompt/ mac terminal
pip install google-api-python-client.
Additionally, since we are using Vader, we also need to run pip install vaderSentiment.
import numpyimport pandasimport datetimeimport reimport seabornimport osfrom dateutil.parser import parsefrom googleapiclient.discovery import buildimport nltkfrom wordcloud import WordCloud, STOPWORDS, ImageColorGeneratorimport plotly.graph_objects as gofrom vaderSentiment.vaderSentiment import SentimentIntensityAnalyzerNote: We are only using the data we have fetched on 9th November, so we would not need to re-run the Get API code.
import requests, sys, time, os, argparse
import numpy as np
import pandas as pd
import datetime as dt
import re
from collections import Counter
from os import path
from PIL import Image
from dateutil.parser import parse
from googleapiclient.discovery import build
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.probability import FreqDist
from nltk.text import Text
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
from nltk import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
#nltk.download()
import nltk
from textblob import TextBlob
from nltk.tokenize import RegexpTokenizer
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import imageio
from IPython.display import Image, HTML
After creating a Google account, we have to create a Google API key for our use only. To ensure the security of the API key, we will not use it directly in the code.
We create a bin file, store the API key inside it, and then fetch the API key from the code.
For ease of access, we have added the token key for your reference.
api_key = ""
#Use below code for securing the api_key
# get the api_file.bin which contains the api key
# with open("api_file.bin",encoding="utf-8") as binary_file:
# apkey = binary_file.read()
# api_key = str(apkey)
# We use the build fuction to create a service object. It takes Api name and the version as arguments along with the apikey
yt = build('youtube','v3',developerKey=api_key)
Here our aim is to get specific features from the API for a set of countries which can be changed as per requirement. Combining the names into a header variable so as to later on add the column names to the dataframes/ csv files which ever we choose.
# the max rows returned by the api is 50 so we set the variable to 50
max_res = 50
country_codes =['US','GB','IN','DE','CA','FR','KR','RU','JP','BR','MX']
output_dir = 'output/'
features_snippet = ["title","publishedAt","channelId","channelTitle","categoryId"]
header = ["video_id"] + features_snippet + ["trending_date", "tags", "view_count", "likes", "dislikes",
"comment_count", "thumbnail_link", "comments_disabled",
"ratings_disabled", "description","videoresolution","videoduration","videoprojected","videoliscensed"]
After some trial and error we found that generally we find issues like \n and quotes which can later on cause issues in the code as both can affect python code so we create a function to replace these features.
Youtube videos have many tags under thier name so to combine these tags we use the join function.
remove_characters = ['\n', '"']
def clean_basicfeatures(feature):
'''Cleans basic features like \n and quotes'''
for ch in remove_characters:
feature = str(feature).replace(ch, "")
return f'"{feature}"'
def return_video_tags(tags_list):
return clean_basicfeatures("-".join(tags_list))
We first create a function for getting the response of the most popular videos on youtube based on country code.
As the API only returns 50 rows at a time we need to fetch the NextPageToken present in the response inorder to re-run the request with the next_page_token to fetch the data from the next page. Below are the steps needed.
def get_response(next_page_token = "",countrycode="US"):
'''This Gets the Api response and returns it based on the most Popular videos by country region by taking in the Country code and Next Page token'''
vids_request = yt.videos().list(part="snippet,contentDetails,statistics",chart="mostPopular",regionCode=countrycode,maxResults=max_res,pageToken=next_page_token)
vids_response = vids_request.execute()
return vids_response
def get_videos_items(video_response):
'''Get the list of videos from each video response and features per country code and reutrns a csv line to be writte to a file'''
csv_data =[]
for item in video_response:
comments_disabled = False
ratings_disabled = False
#Since the resposne consists of list of dictionaries so we iterate over each element in the list that is a dictionary
#Hence we can fetch the respective features by using the key name
video_id = item["id"]
snippet_dict = item["snippet"]
stats_dict = item["statistics"]
cotent_dict = item["contentDetails"]
contentDetail_dict = item["contentDetails"]
vid_res = contentDetail_dict.get("definition","")
vid_duration = contentDetail_dict.get("duration",0)
vid_project = contentDetail_dict.get("projection","")
vid_lisc = str(contentDetail_dict.get("licensedContent",""))
#Running the clean features function for all features
snip_feature = [ clean_basicfeatures(snippet_dict.get(feature,"")) for feature in features_snippet]
vid_desc = snippet_dict.get("description","")
thumbnail_jpg = snippet_dict.get("thumbnails",dict()).get("default",dict()).get("url", "")
#Also storing the date at which the data was scrapped for analyzing the trend times for the videos
scrapped_date = dt.datetime.now().strftime('%Y-%m-%dT%H:%M:%S.%fZ')
video_tags = return_video_tags(snippet_dict.get("tags",["[none]"]))
view_count = stats_dict.get("viewCount",0)
if 'likeCount' in stats_dict and 'dislikeCount' in stats_dict:
likes_count = stats_dict['likeCount']
dislikes_count = stats_dict['dislikeCount']
else:
ratings_disabled = True
likes_count = 0
dislikes_count = 0
if 'commentCount' in stats_dict:
comment_count = stats_dict['commentCount']
else:
comments_disabled = True
comment_count = 0
#Pass all the variables to create a comma separated data to store in the csv file
#Note we don't really have to store it as .csv instead we can directly use it as a dataframe but keeping a .csv helps keep track of all the data scrapped and is easy to reuse
csv_dt_line = [video_id] + snip_feature + [clean_basicfeatures(item) for item in [scrapped_date, video_tags, view_count, likes_count, dislikes_count,comment_count, thumbnail_jpg, comments_disabled,ratings_disabled, vid_desc]] + [vid_res] + [vid_duration] + [vid_project] + [vid_lisc]
csv_data.append(",".join(csv_dt_line))
return csv_data
def get_youtube_pages(country_code="US"):
'''Get all the videos for a country till the next page token is not None i.e has reached the end of the page'''
countrywise_data = []
next_page_tk=""
while next_page_tk is not None:
response_page = get_response(next_page_token = next_page_tk, countrycode = country_code)
next_page_tk = response_page.get("nextPageToken",None)
itemslist = response_page.get('items', [])
countrywise_data += get_videos_items(itemslist)
return countrywise_data
def write_csv(country_code,countrywise_data):
'''Writing the data to a csv file in the output folder of the current directory'''
print(f"Inserting {country_code} data to file")
#create path if not already existing
if not os.path.exists(output_dir):
os.makedirs(output_dir)
#write to each file named according to the time fetched and country code
with open(f"{output_dir}/{dt.datetime.now().strftime('%y.%d.%m')}_{country_code}_video.csv","w+",encoding="utf-8") as file:
for row in countrywise_data:
file.write(f"{row}\n")
def get_youtube_data():
'''Main function to start fetching all popular videos per country and write to the csv file'''
for country in country_codes:
country_dt = [",".join(header)] + get_youtube_pages(country)
write_csv(country,country_dt)
# Start fetching the data and writing to the csv files
# To be commented once fetched and need to use the existing files in the local directory
# get_youtube_data()
We noticed that the categories returned for each videos were numbers and since we needed to find some relation between the category number and its meaning we run another api request to fetch this. Below are the steps.
# Fetch the category wise data for each country.
#This is same but we don't know if any category is banned in some country hence fetching all
vid_cat_dat_country = {}
vid_cat_dat = {}
for code in country_codes:
video_category_datamapping = yt.videoCategories().list(part="snippet",regionCode=code)
video_category_datamapping_res = video_category_datamapping.execute()
for item in video_category_datamapping_res["items"]:
vid_cat_dat[int(item["id"])] = item["snippet"]["title"]
vid_cat_dat_country[code] = vid_cat_dat
video_categories_mapping = vid_cat_dat_country["US"]
video_categories_mapping
After fetching the data we obeserved that it required some cleaning. Below are the steps required.
Country Code to track which data belongs to which country#Loading the data for various countries like US, Brazil, Canada, France, etc. to clean the data and perform analysis on it
us_data = pd.read_csv(r'output/21.09.11_US_video.csv')
br_data = pd.read_csv(r'output/21.09.11_BR_video.csv')
ca_data = pd.read_csv(r'output/21.09.11_CA_video.csv')
de_data = pd.read_csv(r'output/21.09.11_DE_video.csv')
fr_data = pd.read_csv(r'output/21.09.11_FR_video.csv')
gb_data = pd.read_csv(r'output/21.09.11_GB_video.csv')
in_data = pd.read_csv(r'output/21.09.11_IN_video.csv')
jp_data = pd.read_csv(r'output/21.09.11_JP_video.csv')
kr_data = pd.read_csv(r'output/21.09.11_KR_video.csv')
mx_data = pd.read_csv(r'output/21.09.11_MX_video.csv')
ru_data = pd.read_csv(r'output/21.09.11_RU_video.csv')
us_data.head()
#Replacing the missing value columns with NAN to ensure proper analysis can be performed for all countries.
#Since we have a few videos that do not possess a tag and are populated as [none] through the api so we replace them as np.nan
us_data['tags']=us_data['tags'].replace('[none]',np.nan)
br_data['tags']=br_data['tags'].replace('[none]',np.nan)
de_data['tags']=de_data['tags'].replace('[none]',np.nan)
fr_data['tags']=fr_data['tags'].replace('[none]',np.nan)
gb_data['tags']=gb_data['tags'].replace('[none]',np.nan)
in_data['tags']=in_data['tags'].replace('[none]',np.nan)
ca_data['tags']=ca_data['tags'].replace('[none]',np.nan)
jp_data['tags']=jp_data['tags'].replace('[none]',np.nan)
kr_data['tags']=kr_data['tags'].replace('[none]',np.nan)
mx_data['tags']=mx_data['tags'].replace('[none]',np.nan)
ru_data['tags']=ru_data['tags'].replace('[none]',np.nan)
#Inserting a new column Country Name to identify which country the data belongs to
column_new='Country Code'
country_name=['US','BR','CA','DE', 'FR','GB','IN','JP','KR','MX','RU']
us_data[column_new]=country_name[0]
br_data[column_new] = country_name[1]
ca_data[column_new] = country_name[2]
de_data[column_new] = country_name[3]
fr_data[column_new] = country_name[4]
gb_data[column_new] = country_name[5]
in_data[column_new] = country_name[6]
jp_data[column_new] = country_name[7]
kr_data[column_new] = country_name[8]
mx_data[column_new] = country_name[9]
ru_data[column_new] = country_name[10]
us_data.head()
#The change_time function changes the data type of publishedAt and trending_date from String to Date Time Object
def change_time(df1):
'''Taking a dataframe and parsing the date from String to Date time Object'''
df1['publishedAt'] = [parse(d) for d in df1['publishedAt']]
df1['trending_date'] = [parse(x) for x in df1['trending_date']]
df1['trendDuration'] = df1['trending_date']-df1['publishedAt']
change_time(us_data)
change_time(br_data)
change_time(ca_data)
change_time(de_data)
change_time(fr_data)
change_time(gb_data)
change_time(in_data)
change_time(mx_data)
change_time(jp_data)
change_time(kr_data)
change_time(ru_data)
gb_data.head()
To convert the ISO 8601 to second we following the below steps
PT with empty string for each dataframe column videodurationus_data['videoduration']=us_data['videoduration'].replace({'PT':''}, regex=True)
br_data['videoduration']=br_data['videoduration'].replace({'PT':''}, regex=True)
ca_data['videoduration']=ca_data['videoduration'].replace({'PT':''}, regex=True)
de_data['videoduration']=de_data['videoduration'].replace({'PT':''}, regex=True)
fr_data['videoduration']=fr_data['videoduration'].replace({'PT':''}, regex=True)
gb_data['videoduration']=gb_data['videoduration'].replace({'PT':''}, regex=True)
in_data['videoduration']=in_data['videoduration'].replace({'PT':''}, regex=True)
jp_data['videoduration']=jp_data['videoduration'].replace({'PT':''}, regex=True)
kr_data['videoduration']=kr_data['videoduration'].replace({'PT':''}, regex=True)
mx_data['videoduration']=mx_data['videoduration'].replace({'PT':''}, regex=True)
ru_data['videoduration']=ru_data['videoduration'].replace({'PT':''}, regex=True)
def change_second(df):
'''Taking a dataframe column to fetch each hours minutes and seconds part and return the seconds equivalent of all '''
hour_conversion = 3600
second_conversion = 60
hour = 0
minutes = 0
sec = 0
value = ''
# iterater over each row of the column and convert it to seconds equivalent
for x in df:
if x.isdigit():
value += x
continue
elif x == 'H':
hour = int(value) * hour_conversion
elif x == 'M':
minutes = int(value) * second_conversion
elif x == 'S':
sec = int(value)
value = ''
return hour + minutes + sec
us_data['videoduration']=us_data['videoduration'].apply(lambda x: change_second(x) if pd.isna(x)!=True else x)
br_data['videoduration']=br_data['videoduration'].apply(lambda x: change_second(x) if pd.isna(x)!=True else x)
ca_data['videoduration']=ca_data['videoduration'].apply(lambda x: change_second(x) if pd.isna(x)!=True else x)
de_data['videoduration']=de_data['videoduration'].apply(lambda x: change_second(x) if pd.isna(x)!=True else x)
fr_data['videoduration']=fr_data['videoduration'].apply(lambda x: change_second(x) if pd.isna(x)!=True else x)
gb_data['videoduration']=gb_data['videoduration'].apply(lambda x: change_second(x) if pd.isna(x)!=True else x)
in_data['videoduration']=in_data['videoduration'].apply(lambda x: change_second(x) if pd.isna(x)!=True else x)
jp_data['videoduration']=jp_data['videoduration'].apply(lambda x: change_second(x) if pd.isna(x)!=True else x)
kr_data['videoduration']=kr_data['videoduration'].apply(lambda x: change_second(x) if pd.isna(x)!=True else x)
mx_data['videoduration']=mx_data['videoduration'].apply(lambda x: change_second(x) if pd.isna(x)!=True else x)
ru_data['videoduration']=ru_data['videoduration'].apply(lambda x: change_second(x) if pd.isna(x)!=True else x)
gb_data.head()
We use the pandas concat function to concatinate all the dataframes into one
database_combined = pd.concat([us_data,br_data,ca_data,de_data,fr_data,gb_data,in_data,jp_data,kr_data,mx_data,ru_data],ignore_index=True)
database_combined.head()
This section will explore the correlation between various columns like view_count, likes, dislikes, comment_count and videoduration.
correlation_data=database_combined[['view_count','likes','dislikes','comment_count','videoduration']]
plt.figure(figsize=(15,10))
sns.heatmap(correlation_data.corr(),annot=True, cmap='rocket')
plt.show()
Observation:
From the above correlation matrix, we observe that as the number of likes on a video increases, more people view the video. We also observe a similar pattern of increased views, as the number of dislikes and comments count on a video increase.
It is also observed that as video duration increases, fewer viewers engage with the video. We can also see that as video duration increases, the number of likes and dislikes on a video are also negatively impacted.
Inferences:
A video begins to trend as more people watch it. The view count threshold before trending is relative. Liking, Disliking, and commenting on a video increases activity and reach to potential users. Therefore, as the likes, dislikes, and comments on a video increase, its reachability to users also increases.
Psychologists suggest that the average human attention span is 20 minutes. However, for online videos, it seems to be about 60 seconds. Since most users prefer to leisurely scroll on Youtube, lengthy videos can hamper their experience. It would require paying attention to the content for a longer duration and remembering things to make sense of the content. Apart from this, as the famous adage goes, 'Time is Money', most people watching the videos want to gauge the most information in a minimum amount of time. For example, students who are watching informational videos on Youtube prefer to watch short and crisp videos, which would enable them to understand the concepts well in a short period.
Here, we analyze the ideal time to publish a video based on the column publishedAt from the database.
hour_publishedAttime_series_youtube_df = database_combined.set_index("publishedAt")
#get the hours of when its published
time_series_youtube_df["hour_publishedAt"] = time_series_youtube_df.index.hour
time_series_df = pd.DataFrame(time_series_youtube_df["hour_publishedAt"].value_counts()).sort_index()
#colour the top 5 values in descending order
red_clrs = time_series_df.sort_values(by="hour_publishedAt",ascending=False)[:5].index.tolist()
clrs = ["grey" if x not in red_clrs else "red" for x in time_series_df.index]
plt.figure(figsize=(15,10))
#plot the barchart with coloured bars
ax = sns.barplot(x=time_series_df.index,y=time_series_df["hour_publishedAt"], data=time_series_df, palette=clrs)
plt.ylabel("Video Count",fontsize=15)
plt.xlabel("Upload Time",fontsize=15)
plt.title("Video Uploading Time",fontsize=20)
plt.show()
Observation: From this, we can observe that the ideal time to upload a video would be in the morning at 9 am or in the evening from 3 pm to 6 pm.
Inference: We can infer that the videos that are uploaded in the evening tend to trend more than any other time. This is because most of the crowd targeted at that time would be back home from school or work in the evening and want something to watch in order to relax at the end of the day. The anomaly for the 9:00 am is likely due to the videos being uploaded in other countries, primarily India, as it would be 7:00 pm that time.
#selecting english speaking countries
country_filter = ['US','GB','CA']
database_combined[database_combined["Country Code"].isin(country_filter)]
#Creating DataFrame for Sentiment Analysis for Tag
sentiment_polarity_df = database_combined[(database_combined["Country Code"].isin(country_filter))&(database_combined["tags"].notnull())]
def senti(x):
'''This function returns the compound score using positive negative and neutral polarities'''
return SentimentIntensityAnalyzer().polarity_scores(x)["compound"]
#storing the stop words to remove later
stopwords_english = list(stopwords.words('english'))
# Lemmatizataion
lmtzr = WordNetLemmatizer()
categorywise_polairty_total = []
for item in video_categories_mapping:
tagged_words = sentiment_polarity_df[sentiment_polarity_df["categoryId"]==item]["tags"].str.lower().str.cat(sep=' ')
token_words = word_tokenize(re.sub('[^A-Za-z]+',' ',tagged_words))
tagged_words_only = np.array([lmtzr.lemmatize(word) for word in token_words if((word not in stopwords_english)&(not word.isdigit()))])
filtered_words = filter(lambda x : len(x)>2,tagged_words_only)
scores_total = pd.DataFrame(FreqDist(Text(list(filtered_words))).most_common(1000),columns=["Word","Freq"])["Word"].map(senti).sum()
categorywise_polairty_total.append(scores_total)
categorywise_polairty_total_df = pd.DataFrame(categorywise_polairty_total)
category_name_df = pd.DataFrame(video_categories_mapping.values())
final_category_tags_df = pd.concat([category_name_df,categorywise_polairty_total_df],axis=1)
final_category_tags_df.columns=["Categories","Polarity"]
final_category_tags_df_new = final_category_tags_df[final_category_tags_df["Polarity"]!=0].sort_values(by="Polarity",ascending=False)
plt.figure(figsize=(15,10))
ax = sns.barplot(x=final_category_tags_df_new["Polarity"],y=final_category_tags_df_new["Categories"], data=final_category_tags_df_new,orient="h")
plt.ylabel("Categories of Videos")
plt.xlabel("Polarity of Tags")
plt.title("Youtube Categorywise Tags Polarity")
plt.show()
sentiment_polarity_df[sentiment_polarity_df["categoryId"]==20]["tags"].str.lower().str.cat(sep=' ')
video_categories_mapping
final_category_tags_df_new
Observation: We can see that categories like Entertainment Sports and Howto & Style and Music have high positive polarity scores. Meanwhile, categories like Gaming, Travel & Events, Film & animation, and News&Politics have low or negative polarity scores.
Inference: We can infer from this that if we want our videos to be trending among categories like Entertainment, Howto & Style, and Music, we would need to use tag words that have a positive polarity and similarly for the videos to be trending in other categories the YouTuber can use negative polarity words for categories like Gaming, Travel & Events, Film & Animation and News & Politics. Since news and politics are related, their videos generally have a negative trend. It is also not surprising to see gaming having a negative trend since gamers keep using negative polarity words to make content like "pranks" or "battle" that provide strong negative polarity.
To further analyze what Title words are generally trending, we can create a word cloud and check the most common terms that pop up.
# masking Youtubes logo to the wordcloud for the theme
mask_image = imageio.imread('youtube_logo.png')
tagged_words = database_combined[database_combined["categoryId"]==26]["tags"].str.lower().str.cat(sep=' ')
token_words = word_tokenize(re.sub('[^A-Za-z]+',' ',tagged_words))
tagged_words_only = np.array([lmtzr.lemmatize(word) for word in token_words if((word not in stopwords_english)&(not word.isdigit()))])
filtered_words = filter(lambda x : len(x)>2,tagged_words_only)
filtered_words = list(filtered_words)
notuseful_words = ["official","music","video","short","highlights","highlight","trailer","audio","oficial"]
new_words = list(filter(lambda w: w not in notuseful_words, filtered_words))
plt.figure(figsize=(20,15))
title_cloud = WordCloud(colormap='prism',mask=mask_image,background_color="white")
title_cloud = title_cloud.generate(' '.join(new_words))
plt.imshow(title_cloud,interpolation="bilinear")
plt.axis("off")
plt.show()
Observation:
We can notice that apart from the words official, video and music, etc we see that Travis Scott comes up many times along with Mariah Carey as Christmas is coming and her song is starting to trend more now and also Bruno Mars has released a new song Silk Sonic. Travis Scott has come up many times in music not just because of his music but also since there was an incident at his concert which lead to the trend, unfortunately.
Here we try to see the top videos present in our dataset in general. Since, we have also captured the thumbnail link, we decided to show it.
def path_to_image_html(path):
'''
This function essentially convert the image url to
'<img src="'+ path + '"/>' format. And one can put any
formatting adjustments to control the height, aspect ratio, size etc.
'''
return '<img src="'+ path + '" style=max-height:124px;"/>'
#get the top 10 video based on frequency
top_videos_df = database_combined[database_combined["video_id"].isin(database_combined["video_id"].value_counts()[:10].index.tolist())][["thumbnail_link","title","channelTitle","video_id","view_count","categoryId"]]
top_videos_df = top_videos_df.astype({"video_id": str})
top_videos_df = top_videos_df.drop_duplicates(subset = ["video_id"]).reset_index(drop=True)
top_videos_df["CategoryName"] = top_videos_df["categoryId"].map(video_categories_mapping)
#show the dataframe with thumbnail
HTML(top_videos_df[["thumbnail_link","title","channelTitle","view_count","CategoryName"]].to_html(index=False,escape=False ,formatters=dict(thumbnail_link=path_to_image_html)))
Observation:
We noticed that most YouTube viewers like to watch music content by Bruno Mars, Travis Scott, Mariah Carey (Christmas song). They also prefer to watch gaming content like Ozymandias and ELDEN RING - Gameplay Preview.
#determine the top viewd videos overall
top_by_most_views = database_combined.sort_values(by=['view_count'], ascending=False)
top_by_most_views = top_by_most_views[0:10]
top_by_most_views.head()
#Determine the videos that were trending the longest
top_by_trend_duration = database_combined.sort_values(by=['trendDuration'], ascending=False)
top_by_trend_duration = top_by_trend_duration[0:10]
top_by_trend_duration.head(10)
# Pie chart
top_by_trend_duration_sum = top_by_trend_duration.groupby('Country Code')['trendDuration'].sum()
sizes = top_by_trend_duration_sum.values.tolist()
labels = 'Canada', 'Germany','Britain', 'South Korea', 'United States'
#colors
colors = ['#ff1119','#66b3ff','#99ff99','#ffcc99','#B366FF']
#explsion
explode = (0.05,0.05,0.05,0.05,0.05)
plt.figure(figsize=(15,10))
plt.pie(sizes, colors = colors, labels=labels, autopct='%1.1f%%', startangle=90, pctdistance=0.85, explode = explode,textprops={'fontsize': 15})
#draw circle
centre_circle = plt.Circle((0,0),0.70,fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)
plt.tight_layout()
plt.title("The Proportion of Oldest trending video by Country",fontsize=20)
plt.show()
Here we try to find any relevant trends in different categories by region.
def get_top_category_row(df, country_code, cat_name_map, n=5):
'''Return the list of top n trends for each country'''
# grouping by categoryid to find trends in videos specific to category and returning top n rows
result = (
df.groupby('categoryId', as_index=False).size().sort_values(by='size', ascending=False).head(n)
)
#mapping the category id and category name so as to make sense what the ids actually mean
result['categoryName'] = result['categoryId'].map(cat_name_map)
return [country_code, *result['categoryName']]
test_list=[]
# Iterate through each country code and get the top 3 categories for each into a dataframe
for count_code in country_codes:
test_list_itm = get_top_category_row(database_combined[database_combined["Country Code"]== count_code],count_code,video_categories_mapping)
test_list.append(test_list_itm)
top_df = pd.DataFrame(test_list,columns=['Country', 'Top1', 'Top2', 'Top3','Top4', 'Top5'])
top_df
com_usca = top_df.iloc[0]
com_usca = pd.DataFrame(com_usca).rename(columns={0: "US"}).drop(['Country'])
com_usca['CA'] = top_df.iloc[4]
com_usca
com_gbfr = top_df.iloc[1]
com_gbfr = pd.DataFrame(com_gbfr).rename(columns={1: "GB"}).drop(['Country'])
com_gbfr['FR'] = top_df.iloc[5]
com_gbfr
com_krjp = top_df.iloc[6]
com_krjp = pd.DataFrame(com_krjp).rename(columns={6: "KR"}).drop(['Country'])
com_krjp['JP'] = top_df.iloc[8]
com_krjp
We try to analyze the different preferences for different video categories based on the Economic Status
We've categorized the countries into two groups - Developing and Developed Countries based on the Economic status. According to the World Population Review, there are several factors determine whether or not a country is developed, such as its political stability, gross domestic product (GDP), level of industrialization, social welfare programs, infrastructure, and the freedom that its citizens enjoy.
# Set up necessary variables
country_start=0
country_end=7
#Generate numpy list which contains categories from developed and developing countries
developed_list=['US','GB','DE','CA','FR','KR','JP']
developing_list=['IN','RU','BR','MX']
developed = top_df[top_df['Country'].str.contains('|'.join(developed_list))].drop(columns='Country').to_numpy()
developing=top_df[top_df['Country'].str.contains('|'.join(developing_list))].drop(columns='Country').to_numpy()
# Count categories
developed_final=np.concatenate(developed[country_start:country_end])
developing_final=np.concatenate(developing[country_start:country_end])
developed_cate,developed_count=np.unique(developed_final,return_counts=True)
developing_cate,developing_count=np.unique(developing_final,return_counts=True)
# Generate the barchart shows comparision of two groups
cate_name=developed_cate
fig = go.Figure(data=[
go.Bar(name='Developed countries', x=cate_name, y=developed_count/sum(developed_count)*100),
go.Bar(name='Developing countries', x=cate_name, y=developing_count/sum(developing_count)*100)
])
# Change the bar mode
fig.update_layout(barmode='group')
fig.show()
def get_channel(channel_name):
return yt.search().list(q=channel_name,type="channel",part="id,snippet").execute()
def get_videos(channel_id,part="id,snippet",limit=10):
res = yt.channels().list(id=channel_id,part="contentDetails").execute()
playlist_id = yt.channels().list(id=channel_id,part="contentDetails").execute()["items"][0]["contentDetails"]["relatedPlaylists"]["uploads"]
videos=[]
next_page_token = None
while 1:
res = yt.playlistItems().list(playlistId=playlist_id, part= part,maxResults=min(limit,50),pageToken=next_page_token).execute()
videos+=res["items"]
next_page_token = res.get("nextPageToken")
if next_page_token is None or len(videos) >=limit:
break
return videos
def parse_publish_timestamp(video,changehours=0,changeminutes=0):
return (dt.datetime.strptime(video["snippet"]["publishedAt"],"%Y-%m-%dT%H:%M:%SZ")+dt.timedelta(hours=changehours,minutes=changeminutes))
channel_Id_piewdiepiew = get_channel("pewdiepie")["items"][0]["id"]["channelId"]
channel_Id_tseries = get_channel("t-series")["items"][0]["id"]["channelId"]
videos_piewdie = get_videos(channel_Id_piewdiepiew,limit=50)
videos_tseries = get_videos(channel_Id_tseries,limit=50)
publish_timestamp_piewdie = [parse_publish_timestamp(video,changehours=1) for video in videos_piewdie]
publish_timestamp_t_series = [parse_publish_timestamp(video,changehours=5,changeminutes=30) for video in videos_tseries]
publish_time_only_piewdie = [t.hour + t.minute/60 for t in publish_timestamp_piewdie]
publish_time_only_t_series = [t.hour + t.minute/60 for t in publish_timestamp_t_series]
plt.figure(figsize=(15,10))
plt.subplot(1,2,1)
plt.hist(publish_time_only_piewdie,bins=30)
plt.title("Piewdiepie",fontsize=15)
plt.subplot(1,2,2)
plt.hist(publish_time_only_t_series,bins=30)
plt.title("T-Series",fontsize=15)
plt.suptitle('Piewdiepie VS T-Series',fontsize=20)
plt.show()
With our analysis potential youtubers and content creators can gain insights on what factors to account for, while aiming to make their videos trend.
Key takeaways from the Analysis:
All these factors are some of the key ingredients to make your delicious Pie(y)-Tube.